Introduction

NYC is a hub of diversity, yet there is one similarity that 8.5 million people shares – everyone eats. Over 26,000 eating establishments in the city (according to the NYC Department of Health and Mental Hygiene or DOHMH) thrives here because New Yorkers dine out over 58% of their lunches or dinners and spend $46.14 on average for each meal per person (Zagat Dining Trends Survey 2018). For these customers, ourselves included, the food that we purchase so frequently and at such high costs is a major factor influencing our health. To help people stay informed about food safety and the potential health risks that eating out poses, we conducted our research about restaurants in NYC based on DOHMH’s source data hosted on Open Data.

\(Team\) \(Members:\) Cindy Wu, Julie Yu, Lisa Kim, Selina Tang

Cindy- data description, data quality, time series overview
Julie- grade distributions and analysis
Lisa- violation distributions and analysis
Selina- restaurants case by case

Dataset: NYC Restaurant Inspection Data

Data Quality

Overview: Missing Data and data cleaning

click here to see interactive visna plots made with D3

We first explore the distribution of missing data in the original dataset. As shown in the chart and plot, the most common missing pattern in the original data is Grade on its own with over 50% missing data, followed by only around 5% in scores and minor missing percentages in violation details.

## # A tibble: 10 x 4
##    variable              type       value percent_missing
##    <chr>                 <chr>      <dbl>           <dbl>
##  1 GRADE                 character 188585          50.2  
##  2 SCORE                 integer    20466           5.45 
##  3 VIOLATION.DESCRIPTION character   7106           1.89 
##  4 VIOLATION.CODE        character   6010           1.60 
##  5 INSPECTION.TYPE       character   1108           0.295
##  6 DBA                   character    419           0.112
##  7 BORO                  character      0           0    
##  8 ZIPCODE               character      0           0    
##  9 CUISINE.DESCRIPTION   character      0           0    
## 10 INSPECTION.DATE       character      0           0

From the official grading system in data description, we learned that the grades are converted from scores accroding to a specific scheme for every ‘gradable’ inspection, and only inspections of certain types are eligble to receive a grade. Due to the large amount of missing values in Grade, we pay more attention to scores and violation details since they are good and direct reflections of inspection results and contain much fewer missing data.

In addition to grades, the data description also states that some scores are missing because there are new restarants that have yet to be inspected, and they are marked by “inspection.date = 01/01/1900”. Therefore, we removed rows with new restaurants’ and looked at missing values again.

The second chart shows the missing value percentages for each variable after removing new restaurants. We see a decrease in missing scores and for our exploratory analysis, we removed all missing values in scores.

## [1] "Number of new restaurants: 1108"
## # A tibble: 10 x 4
##    variable              type       value percent_missing
##    <chr>                 <chr>      <dbl>           <dbl>
##  1 GRADE                 character 187478           50.1 
##  2 SCORE                 integer    19358            5.17
##  3 VIOLATION.DESCRIPTION character   5998            1.60
##  4 VIOLATION.CODE        character   4902            1.31
##  5 DBA                   character      0            0   
##  6 BORO                  character      0            0   
##  7 ZIPCODE               character      0            0   
##  8 CUISINE.DESCRIPTION   character      0            0   
##  9 INSPECTION.DATE       character      0            0   
## 10 INSPECTION.TYPE       character      0            0

The third chart summarizes the final data we used for the report. Although there are still many missing values in grades, we do not consider it bad data quality since they are intentionally left blank.The overall quality of the dataset seems quite promising at this point.

## # A tibble: 10 x 4
##    variable              type       value percent_missing
##    <chr>                 <chr>      <dbl>           <dbl>
##  1 GRADE                 character 168130          47.4  
##  2 VIOLATION.DESCRIPTION character   1934           0.545
##  3 VIOLATION.CODE        character   1492           0.420
##  4 DBA                   character      0           0    
##  5 BORO                  character      0           0    
##  6 ZIPCODE               character      0           0    
##  7 CUISINE.DESCRIPTION   character      0           0    
##  8 INSPECTION.DATE       character      0           0    
##  9 SCORE                 integer        0           0    
## 10 INSPECTION.TYPE       character      0           0

A Closer Look: borough and inspection types

To ensure data quality and validify our further analysis, we inspect the data more closely based on two variables we are particularly interested in- boroughs and inspection types.

# By borough
percent_missing1 <- df_1 %>% group_by(BORO) %>% 
  summarize(num_Restaurants = n(), num_Missing_Score = sum(is.na(`SCORE`))) %>% 
  mutate(percent_Missing_Score = round(num_Missing_Score/num_Restaurants, 2)) %>% 
  arrange(-percent_Missing_Score)

p1<- ggplot(percent_missing1)+
  geom_line(aes(x= BORO, y =percent_Missing_Score, group=1))+
  geom_point(aes(x= BORO, y =percent_Missing_Score),color = "hotpink")+
  ylim(0,0.15)+
  ggtitle("Missing data percentage by borough")
p1 <- ggplotly(p1)
p1
# By inspection type
percent_missing <- df_1 %>% group_by(INSPECTION.TYPE) %>% 
  summarize(num_Restaurants = n(), num_Missing_Score = sum(is.na(`SCORE`))) %>% 
  mutate(percent_Missing_Score = round(num_Missing_Score/num_Restaurants, 2)) %>% 
  arrange(-percent_Missing_Score)

p2 <- ggplot(percent_missing)+
  geom_line(aes(x= INSPECTION.TYPE, y =percent_Missing_Score, group=1))+
  geom_point(aes(x= INSPECTION.TYPE, y =percent_Missing_Score),color = "blue")+
  ggtitle("Missing data percentage by Inspection Type")
p2 <- ggplotly(p2)
p2

Each borough has very similar percentage in missing values so it is fair to conclude that the data qulity of our dataset is identical across the city.

Exploratory Data Analysis

The following time series plot gives a general overview of NYC restaurant cleanliness over the past five year.

df <- df[!duplicated(df[c('DBA', 'INSPECTION.DATE', 'GRADE')]),]
df$INSPECTION.DATE <- as.Date(as.character(df$INSPECTION.DATE),format="%m/%d/%Y")
ts <- xts(df$SCORE,df$INSPECTION.DATE)
quarterly <- apply.quarterly(ts, FUN =mean)
ts_df<- data.frame(date=index(ts),score = ts)
quarterly_df <- data.frame(date = index(quarterly), quarter_avg = quarterly)

p<-plot_ly(quarterly_df[quarterly_df$date> "2013-05-16" & quarterly_df$date < "2018-03-31" ,])%>% 
  add_lines( x=~date, y=~quarter_avg,
        type = 'scatter',
        mode = 'lines+markers',
        hoverinfo = 'text',
        text = ~paste("Quarter Date:",date,"<br> Average Score: ",round(quarter_avg)))%>%
  add_trace(x = ~date,y = ~quarter_avg, mode = 'markers',color=I("orange"),marker = list(size = 8))%>%
        layout(title = "Quarter Average Score",showlegend = FALSE)
p

There is a clear pattern that the third quarter (Jul - Sept) has the highest average score each year. Thinking in terms of season and weather, since July, August and September are typically the hottest months in NYC, we can infer that there are more violations during the summer. In the next sections, we go into more details about inspection results and look for more interesting patterns from two main perspectives- grade and violations.

Grades

  • Missing Grades *

In this section, we will be focusing on the grades the restaurants in NYC received. Like before, we exclued those with dates = 1/1/1900, which indicated that they’re new restaurants yet to be inspected. We also excluded rows without a score since those without a score would also have no grade and that grades are determined based on scores. And we’re keeping missing grades as of now for further investigation later on.

According to the manual of this data set, only inspections of the type in the following four categories are gradable: - Cycle Inspection/Initial Inspection - Cycle Inspection/Re-Inspection - Pre-Permit (Operational)/Initial Inspection - Pre-Permit (Operational)/Re-Inspection) Therefore, in this section, we filtered out restaurants that have inspection types not from the above four categories. Also, for the following studies, we will exclude grades that are “Not Yet Graded”, “P”, or “Z” which indicate pending grades under different circumstances.

In addition, a restaurant can receive multiple violtaions within the same day of an inspection, resulting in multiple rows in the data set. Therefore, the duplicated grades given to a particular restaurant on the same day should be only counted once.

We first looked at the distribution of grades for A, B, and C and excluding missing values.

ggplot(filter(df,grade=='A'|grade=='B'|grade=='C'), aes(x=grade)) + geom_bar() + labs(title="Restaurant Grades")

We then looked at how many missing values there are in grades after eliminating all the duplicated entries.

skimr::skim(df$grade)
## Skim summary statistics
## 
## Variable type: character 
##  variable missing complete      n min max empty n_unique
##  df$grade       0   118571 118571   0  14 41720        6

According to the summary statistics above, we see that there are many inspections that have no grades (ie empty values). We then checked the manual to try to figure out why there are missing values and ways to deal with them. According to the manual, for restaurants that didn’t receive a score less than 14 (which corresponded to an A) during their initial inspection, they would be given another chance for re-inspection. And a grade would not be given until re-inspection. In order to capture this information and conduct some futher studies, we inputed the missing grades by converting scores to grades according to NYC Department of Health and Mental Hygiene’s (DOHMH) grading rules.

convert_to_grade <- function(x){
  if (x < 14){
    return("A")
  }
  else if(x > 28){
    return("C")
  }
  else {
    return("B")
  }
}
df$imp_grade <- sapply(df$score, convert_to_grade)

We’re now interested in whether there is a difference in the grade distribution between simply looking at grades given by NYC DOHMH as they are and grades that include information of initial inspections.

p1 <- ggplot(filter(df,grade=='A'|grade=='B'|grade=='C'), aes(x=grade)) + geom_bar() + labs(title="Restaurant Grades")
p2 <- ggplot(filter(df,(imp_grade=='A'|imp_grade=='B'|imp_grade=='C')&(inspection_type=="Cycle Inspection / Initial Inspection" | inspection_type== "Pre-permit (Operational) / Initial Inspection")), aes(x=imp_grade)) + geom_bar() + labs(title="Restaurant Grades at Initial Inspections",x="grade")
grid.arrange(p1, p2, nrow = 1)

From the above graphs, we can see that the distributions are quite different. If we had only simply looked at the grades given as they were (graph on the left), the percentage of A restaurants would be pretty high. However, looking at the graph on the right tells us that actually many restaurants did not receive an A at their initial inspection.

Since the above graph on the left consists of both initial inspection and re-inspection grades without missing grades while the graph on the right consists of only initial inspection with imputed grades, it could be hard to make a direct comparison. Therefore, we introduced two variables explained below and made a mosaic plot for smoother transitions from one to another for easier comparison, which also incoporates information about the different sample sizes.

  • imputation: yes = missing grades are imputed as discussed previously; no = missing grades are excluded
  • inspection: initial = initial insepction; re = re-inspection; both = initial and re-inspection combined

Note that, non-imputed data are just a subset of the imputed data and also initial inspection and re-inspection combined would belong to the “both” category for inspection. We did this because looking at the “both” category is the natural way we’d first look at the data and it’d be hard to imagine this category and doing the math in our head ourselves if we’re to only give the “initial” and “re” categories.

df1 <- filter(df, grade=='A'|grade=='B'|grade=='C') %>% select(c('inspection_type','grade')) %>% group_by(grade) %>% summarise(Freq=n())
df1$imputation <- c("no")
df1$inspection <- c("both")
colnames(df1) <- c("grade","Freq","imputation","inspection")
#summarise(all_insp_df,sum(Freq))

df2 <- filter(df, grade=='A'|grade=='B'|grade=='C') %>% select(c('inspection_type','grade')) %>% filter(inspection_type=="Cycle Inspection / Initial Inspection" | inspection_type== "Pre-permit (Operational) / Initial Inspection") %>% group_by(grade) %>% summarise(Freq=n())
df2$imputation <- c("no")
df2$inspection <- c("initial")
colnames(df2) <- c("grade","Freq","imputation","inspection")
#summarise(all_insp_df,sum(Freq))

df3 <- filter(df, grade=='A'|grade=='B'|grade=='C') %>% select(c('inspection_type','grade')) %>% filter(inspection_type=="Cycle Inspection / Re-inspection" | inspection_type== "Pre-permit (Operational) / Re-inspection") %>% group_by(grade) %>% summarise(Freq=n())
df3$imputation <- c("no")
df3$inspection <- c("re")
colnames(df3) <- c("grade","Freq","imputation","inspection")
#summarise(all_insp_df,sum(Freq))

df4 <- df %>% select(c('inspection_type','imp_grade')) %>% group_by(imp_grade) %>% summarise(Freq=n())
df4$imputation <- c("yes")
df4$inspection <- c("both")
colnames(df4) <- c("grade","Freq","imputation","inspection")
#summarise(imp_insp_df,sum(Freq))

df5 <- df %>% select(c('inspection_type','imp_grade')) %>% filter(inspection_type=="Cycle Inspection / Initial Inspection" | inspection_type== "Pre-permit (Operational) / Initial Inspection") %>% group_by(imp_grade) %>% summarise(Freq=n())
df5$imputation <- c("yes")
df5$inspection <- c("initial")
colnames(df5) <- c("grade","Freq","imputation","inspection")
#summarise(init_insp_df,sum(Freq))

df6 <- df %>% select(c('inspection_type','imp_grade')) %>% filter(inspection_type=="Cycle Inspection / Re-inspection" | inspection_type== "Pre-permit (Operational) / Re-inspection") %>% group_by(imp_grade) %>% summarise(Freq=n())
df6$imputation <- c("yes")
df6$inspection <- c("re")
colnames(df6) <- c("grade","Freq","imputation","inspection")
#summarise(re_insp_df,sum(Freq))
df7 <- rbind(df1,df2,df3,df4,df5,df6)
df7 <- droplevels(df7)
df7$imputation <- as.factor(df7$imputation)
df7$inspection <- as.factor(df7$inspection)
vcd::mosaic(grade ~inspection+imputation,df7,main="Grade Distribution vs Inspection Type vs Imputation")

Note that when fixing inspection=re, there is slight difference before and after imputation when there shouldn’t be because, according to the manual, it’s possible that a record of the grade is simply missing in the data even if the grade was issued. And as we’d expect, without imputation (filling in missing grades), the initial inspections only have A grades. Fixing inspection=initial, we see a big difference when we impute the data – there are about 40% of restaurants that actually did not receive an A during their first inspection. As the way the data was given to us (“both”+“none” categories), if we simply ignore the missing values, we’d assume that a high percentage of restaurants in NYC have an A. However, as shown in the graph, this is not true if we take into account the initial inspection grades (ie “yes” category). So this tells us that we shouldn’t just trust that almost all restaurants in NYC have been always clean and healthy. But also, we can see that the enforcement of inspections can be effective in encouraging the restaurants to improve their cleanliness and health levels.

** Grades vs Restaurant Types **

We now shift our focus to examining if certain types of restaurants are cleaner and healthier than ther others. Since New York City requires restaurant to post their grades and the customers are often more familiar with this grading system, we decided to look at grades instead of scores. Above are tables showing top 20 restaurant types with the highest percentage of receiving an A (count of A’s/total counts of A+B+C within its own category) and a C. And for this study, we decided to use grades from both initial insepctions and re-inspections excluding missing grades since these are the actual grades the restaurants will post.

df_cuisine <- df[,c('name','cuisine','grade')]

df_cuisine <- df_cuisine %>% filter(grade=='A'|grade=='B'|grade=='C')

df_cuisine <- df_cuisine %>% group_by(cuisine) %>% summarise(count_A=sum(grade == "A", na.rm = T),count_B=sum(grade == "B", na.rm = T),count_C=sum(grade == "C", na.rm = T))

df_cuisine <- df_cuisine %>% group_by(cuisine) %>% mutate(total_count=count_A+count_B+count_C)

df_cuisine <- df_cuisine %>% group_by(cuisine) %>% mutate(percentage_A=count_A/total_count)

df_cuisine <- df_cuisine %>% group_by(cuisine) %>% mutate(percentage_B=count_B/total_count)

df_cuisine <- df_cuisine %>% group_by(cuisine) %>% mutate(percentage_C=count_C/total_count)


kable(arrange(df_cuisine, desc(`percentage_A`))[1:20,],caption = "Top 20 A Percentages")
Top 20 A Percentages
cuisine count_A count_B count_C total_count percentage_A percentage_B percentage_C
Basque 1 0 0 1 1.0000000 0.0000000 0.0000000
Cajun 30 0 0 30 1.0000000 0.0000000 0.0000000
Czech 10 0 0 10 1.0000000 0.0000000 0.0000000
Fruits/Vegetables 16 0 0 16 1.0000000 0.0000000 0.0000000
Not Listed/Not Applicable 22 0 0 22 1.0000000 0.0000000 0.0000000
Nuts/Confectionary 13 0 0 13 1.0000000 0.0000000 0.0000000
Scandinavian 14 0 0 14 1.0000000 0.0000000 0.0000000
Soups 12 0 0 12 1.0000000 0.0000000 0.0000000
Soups & Sandwiches 143 3 0 146 0.9794521 0.0205479 0.0000000
Other 359 9 3 371 0.9676550 0.0242588 0.0080863
Portuguese 29 1 0 30 0.9666667 0.0333333 0.0000000
Pancakes/Waffles 56 1 1 58 0.9655172 0.0172414 0.0172414
Salads 183 7 0 190 0.9631579 0.0368421 0.0000000
Hotdogs 76 3 0 79 0.9620253 0.0379747 0.0000000
Ice Cream, Gelato, Yogurt, Ices 787 27 6 820 0.9597561 0.0329268 0.0073171
Café/Coffee/Tea 3944 150 31 4125 0.9561212 0.0363636 0.0075152
Steak 271 9 4 284 0.9542254 0.0316901 0.0140845
Sandwiches/Salads/Mixed Buffet 598 28 4 630 0.9492063 0.0444444 0.0063492
Hamburgers 1055 48 10 1113 0.9478886 0.0431267 0.0089847
Donuts 1114 55 8 1177 0.9464741 0.0467290 0.0067969
kable(arrange(df_cuisine, desc(`percentage_C`))[1:20,],caption = "Top 20 C Percentages")
Top 20 C Percentages
cuisine count_A count_B count_C total_count percentage_A percentage_B percentage_C
Polynesian 2 1 1 4 0.5000000 0.2500000 0.2500000
Iranian 8 0 1 9 0.8888889 0.0000000 0.1111111
Bangladeshi 107 21 8 136 0.7867647 0.1544118 0.0588235
Australian 51 2 2 55 0.9272727 0.0363636 0.0363636
Creole 63 18 3 84 0.7500000 0.2142857 0.0357143
Peruvian 198 36 8 242 0.8181818 0.1487603 0.0330579
African 174 52 7 233 0.7467811 0.2231760 0.0300429
Asian 876 113 27 1016 0.8622047 0.1112205 0.0265748
Korean 672 110 21 803 0.8368618 0.1369863 0.0261519
Soul Food 131 20 4 155 0.8451613 0.1290323 0.0258065
Latin (Cuban, Dominican, Puerto Rican, South & Central American) 2460 448 77 2985 0.8241206 0.1500838 0.0257956
Moroccan 33 5 1 39 0.8461538 0.1282051 0.0256410
Vegetarian 284 25 8 317 0.8958991 0.0788644 0.0252366
Indian 838 113 24 975 0.8594872 0.1158974 0.0246154
Spanish 1579 264 45 1888 0.8363347 0.1398305 0.0238347
Caribbean 1761 289 50 2100 0.8385714 0.1376190 0.0238095
Tapas 83 6 2 91 0.9120879 0.0659341 0.0219780
Mexican 2374 282 57 2713 0.8750461 0.1039440 0.0210100
Middle Eastern 439 38 10 487 0.9014374 0.0780287 0.0205339
Pakistani 82 14 2 98 0.8367347 0.1428571 0.0204082
tidyv_cuisine <- df_cuisine[,c('cuisine','percentage_A','percentage_B','percentage_C','total_count')]
tidyv_cuisine <- tidyv_cuisine %>% rename(A = percentage_A,B = percentage_B,C = percentage_C)

tidyv_cuisine_top_counts <- arrange(tidyv_cuisine, desc(`total_count`))[1:20,]

tidyv_cuisine_top_counts$cuisine <- factor(tidyv_cuisine_top_counts$cuisine,
                                   levels=tidyv_cuisine_top_counts$cuisine[order(-tidyv_cuisine_top_counts$A)])

levels(tidyv_cuisine_top_counts$cuisine)[levels(tidyv_cuisine_top_counts$cuisine)=="Latin (Cuban, Dominican, Puerto Rican, South & Central American)"] <- "Latin" # for better display

tidyv_cuisine_top_counts <- tidyv_cuisine_top_counts %>% gather(key=GradeType,value=Percentage,-cuisine,-total_count)

tidyv_cuisine_top_A <- arrange(tidyv_cuisine, desc(`A`))[1:20,]

tidyv_cuisine_top_A$cuisine <- factor(tidyv_cuisine_top_A$cuisine,
                              levels=tidyv_cuisine_top_A$cuisine[order(-tidyv_cuisine_top_A$A)])

tidyv_cuisine_top_A <- tidyv_cuisine_top_A %>% gather(key=GradeType,value=Percentage,-cuisine,-total_count)

tidyv_cuisine_top_C <- arrange(tidyv_cuisine, desc(`C`))[1:20,]

tidyv_cuisine_top_C$cuisine  <- factor(tidyv_cuisine_top_C$cuisine,
                              levels=tidyv_cuisine_top_C$cuisine[order(-tidyv_cuisine_top_C$C)])

levels(tidyv_cuisine_top_C$cuisine)[levels(tidyv_cuisine_top_C$cuisine)=="Latin (Cuban, Dominican, Puerto Rican, South & Central American)"] <- "Latin"

tidyv_cuisine_top_C <- tidyv_cuisine_top_C %>% gather(key=GradeType,value=Percentage,-cuisine,-total_count)


ggplot(tidyv_cuisine_top_counts, aes(x=cuisine,y=Percentage,fill=GradeType)) + 
          geom_bar(stat="identity") +
          coord_flip()+labs(title="Top 20 Common Restaurants' Grade Percentages")

ggplot(tidyv_cuisine_top_A, aes(x=cuisine,y=Percentage,fill=GradeType)) + 
          geom_bar(stat="identity") +
          coord_flip()+labs(title="Top 20 A Percentages")

ggplot(tidyv_cuisine_top_C, aes(x=cuisine,y=Percentage,fill=GradeType,order_by())) + 
          geom_bar(stat="identity") +
          coord_flip()+labs(title="Top 20 C Percentages")

The first plot is the grade percentages for the top 20 types of restaurants with the highest total grade counts. The second plot is the grade percentages for the top 20 restaurant types that received the highest percentages of A. And the third plot is the grade percentages for the top 20 restaurant types that received the highest percentages of C.

We can see that among the restaurant types with the top total counts, Cafe, Hamburgers and Donuts are the top 3 in receiving A’s even though none of the made to the top 20 list of receiving A’s among all restaurant types. We can also see that there are several that have 100% A’s, including Cajun, Czech, Fruits/Vegetables, Not Listed/Not Applicable, Nuts/Confectionary, Scandinavian and Soups. However, if we look closer on the tables above, these types of restaurants actually have a very small sample size (<=30). Therefore, it’s possible that them having very few samples made it less likely to have that one bad sample in order to achieve 100% A’s. Among the top 20 C list, we see that Polynesian, Iranian, and Bangladeshi made the top 3. However, if we look at the table again, we see that both Polynesian and Iranian have a sample less than 10. So again, this tells us that they may not be representative. And in general, the top A list consists of European food while the top C list is a mixture of Latin, Asian, African, Caribbean, Australian, and Spanish restaurants.

In the following sections, we will dive deeper into different violations, scores, and some of our most frequented restaurants.

Violations

## [1] 354841
convert_to_grade <- function(x){
  if (x < 14){
    return("A")
  }
  else if(x > 28){
    return("C")
  }
  else {
    return("B")
  }
}
grade <- sapply(df$score, convert_to_grade)
df$grade <- grade

Combining the cuisine descriptions

df$cuisine <- factor(df$cuisine)
levels(df$cuisine) <- sub("Pizza/Italian", "Italian", levels(df$cuisine))
levels(df$cuisine) <- sub("Pizza", "Italian", levels(df$cuisine))
levels(df$cuisine) <- sub("Café/Coffee/Tea", "Dessert", levels(df$cuisine))
levels(df$cuisine) <- sub("Fruits/Vegetables", "Salads",levels(df$cuisine))
levels(df$cuisine) <- sub("Hotdogs/Pretzels", "Hotdogs", levels(df$cuisine))
levels(df$cuisine) <- sub("Ice Cream, Gelato, Yogurt, Ices", "Dessert", levels(df$cuisine))
levels(df$cuisine) <- sub("Juice, Smoothies, Fruit Salads", "Salads", levels(df$cuisine))
levels(df$cuisine) <- sub("Latin (Cuban, Dominican, Puerto Rican, South & Central American)",
                            "Latin", levels(df$cuisine))
levels(df$cuisine) <- sub("Sandwiches/Salads/Mixed Buffet", "Sandwiches", levels(df$cuisine))
levels(df$cuisine) <- sub("Bottled beverages, including water, sodas, juices, etc.", "Dessert", levels(df$cuisine))
levels(df$cuisine) <- sub("Not Listed/Not Applicable", "Other", levels(df$cuisine))
levels(df$cuisine) <- sub("Latin (Cuban, Dominican, Puerto Rican, South & Central American)", "Latin", levels(df$cuisine))
levels(df$cuisine) <- sub("Soups & Sandwiches", "Sandwiches", levels(df$cuisine))

Because some of the restaurant types are too specific, we decide to combine some of the categories to make it more simple and interpretable. As a result, all the restaurants with restaurant types that include word Pizza are categorized as Italian restaurants; all the cafes, stores that sell ice cream, drinks are categorized as Dessert; the restaurants with the restaurant type of not listed/not applicable are categorized as others, etc.

df$score <- as.numeric(df$score)
ggplot(df, aes(score)) + geom_histogram(binwidth = 5, boundary = 0, color = "black", fill = "blue") + labs(title = "Distribution of scores")

ggplot(df, aes(score)) + geom_histogram(binwidth = 5, boundary = 0, color = "black", fill = "blue") + labs(title = "Distribution of scores based on borough") +
  facet_grid(boro~., scale = "free")

The plot shown above describes the overall pattern of the scores. It is shown that the distribution of scores is skewed to the right. While it does not look reasonable to have a negative score, some of the restaurants receive the negative score (This is why the histogram doesn’t start from 0). The scores that the restaurants mostly received for any type of violations are between 10 to 20. Also, there exist quite a lot of restaurants that receive a score higher than 25 for any type of violations. Because the score above 25 is associated with grade C, which is bad, we could probably consider such restaurants as not cleaned and not well maintained.

Also, we plot the distribution of scores based on boroughs to explore the relationship between borough and scores. Because we are interested in the distribution for each borough, we set the scale to free to make sure the distributions for some boroughs do not shrink. While there is a slight change in the distribution of scores across boroughs, they are all highly skewed to the right, and the shape of the distributions looks almost the same. This shows that there is no or little relationship between the boroughs and the scores.

# grade distribution for each violation code
violation_df <- df %>% select(violation_code, grade) %>% group_by(violation_code, grade) %>% summarize(count = n())
violation_df <- violation_df %>% filter(!is.na(violation_code))
violation_df = df %>% select(violation_code) %>% group_by(violation_code) %>% summarize(count = n())  %>% arrange(-count)

ggplot(violation_df, aes(reorder(violation_code, count), count)) + geom_col() + coord_flip() +
  labs(title = "Distribution of violation code", x = "violation code")

violation_df_top10 <- violation_df[1:10, ]

violation_des <- violation[violation$VIOLATION.CODE %in% violation_df_top10$violation_code, ]
violation_des$VIOLATION.CODE <- factor(violation_des$VIOLATION.CODE, levels = violation_df_top10$violation_code)
violation_des <- violation_des %>% arrange(VIOLATION.CODE)
print(violation_des)
##    VIOLATION.CODE
## 1             10F
## 2             08A
## 3             04L
## 4             06C
## 5             06D
## 6             02G
## 7             10B
## 8             02B
## 9             04N
## 10            04H
##                                                                                                                                                                                                                                                                                   VIOLATION.DESCRIPTION
## 1                       Non-food contact surface improperly constructed. Unacceptable material used. Non-food contact surface or equipment improperly maintained and/or not properly sealed, raised, spaced or movable to allow accessibility for cleaning on all sides, above and underneath the unit.
## 2                                                                                                                                                                    Facility not vermin proof. Harborage or conditions conducive to attracting vermin to the premises and/or allowing vermin to exist.
## 3                                                                                                                                                                                                                       Evidence of mice or live mice present in facility's food and/or non-food areas.
## 4                                                                                                                                                                            Food not protected from potential source of contamination during storage, preparation, transportation, display or service.
## 5                                                                                                                                                        Food contact surface not properly washed, rinsed and sanitized after each use and following any activity when contamination may have occurred.
## 6                                                                                                                                                      Cold food item held above 41Ã\u0082º F (smoked fish and reduced oxygen packaged foods above 38 Ã\u0082ºF) except during necessary preparation.
## 7                                                                    Plumbing not properly installed or maintained; anti-siphonage or backflow prevention device not provided where required; equipment or floor not properly drained; sewage disposal system in disrepair or not functioning properly.
## 8                                                                                                                                                                                                                                                    Hot food item not held at or above 140Ã\u0082º F.
## 9  Filth flies or food/refuse/sewage-associated (FRSA) flies present in facility\032s food and/or non-food areas. Filth flies include house flies, little house flies, blow flies, bottle flies and flesh flies. Food/refuse/sewage-associated flies include fruit flies, drain flies and Phorid flies.
## 10                                                                                                                                                                       Raw, cooked or prepared food is adulterated, contaminated, cross-contaminated, or not discarded in accordance with HACCP plan.
ggplot(violation_df_top10, aes(reorder(violation_code, count), count)) + geom_col() + coord_flip() +
  labs(title = "Distribution of top 10 violation code", x = "violation code")

The most commonly violated violation type is 10F, ‘Non-food contact surface improperly constructed’. While this violation seems not too serious, the 08A and 04L are associated with vermin and mice, which sound more disgusting and serious. Moreover, 06C is the violation code that is ‘Food not protected from potential source of contamination during storage, preparation, transportation, display or service.’ Learning that there are restaurants in New York which have issue with vermin and mice while not properly storing and protecting foods from any potential source of contamination may suggest us to be very careful and picky when deciding which restaurant to go.

#getting the number of restaurants for each restaurant type
distinct_res <- df %>% select(cuisine, name) %>% distinct(cuisine, name) %>% mutate(count = 1)
distinct_res <- distinct_res %>% select(cuisine, count) %>% group_by(cuisine) %>% summarise(total_num = sum(count)) %>% arrange(-total_num)

ggplot(distinct_res, aes(reorder(cuisine, total_num), total_num)) + geom_col() + coord_flip() +
  labs(title = "Number of restaurants", x = "Cuisine")

top5res <- distinct_res[1:5, ]
ggplot(top5res, aes(reorder(cuisine, total_num), total_num)) + geom_col() + coord_flip() +
  labs(title = "Number of restaurants", x = "Cuisine")

The bar chart shows the number of restaurants for each restaurant type. Because there are more than 50 restaurant types in this data, we decide to focus on 5 most common restaurant types. The top 5 restaurant types are following: American, Italian, Chinese, Dessert, and Latin. As shown in the plot above, the most common restaurant type is American (5310), followed by Italian (2224) and Chinese (2088).

# getting the information associated with 5 most common restaurant types
top5 <- df[df$cuisine %in% top5res$cuisine, ]
top5 <- na.omit(top5)
top5$cuisine <- factor(top5$cuisine)

# group by violation code to see the most commonly violated violation codes
top5_violations = top5 %>% select(violation_code) %>% group_by(violation_code) %>% summarize(count = n())  %>% arrange(-count)
ggplot(top5_violations, aes(reorder(violation_code, count), count)) + geom_col() + coord_flip() +
  labs(title = "Distribution of violation code", x = "violation code")

violations_10mostcommon <- top5_violations[1:10, ]

new_violation <- violation[violation$VIOLATION.CODE %in% violations_10mostcommon$violation_code, ]
new_violation$VIOLATION.CODE <- factor(new_violation$VIOLATION.CODE, levels = violations_10mostcommon$violation_code)
new_violation <- new_violation %>% arrange(VIOLATION.CODE)
print(new_violation)
##    VIOLATION.CODE
## 1             10F
## 2             08A
## 3             06D
## 4             04L
## 5             06C
## 6             02G
## 7             10B
## 8             02B
## 9             04N
## 10            04H
##                                                                                                                                                                                                                                                                                   VIOLATION.DESCRIPTION
## 1                       Non-food contact surface improperly constructed. Unacceptable material used. Non-food contact surface or equipment improperly maintained and/or not properly sealed, raised, spaced or movable to allow accessibility for cleaning on all sides, above and underneath the unit.
## 2                                                                                                                                                                    Facility not vermin proof. Harborage or conditions conducive to attracting vermin to the premises and/or allowing vermin to exist.
## 3                                                                                                                                                        Food contact surface not properly washed, rinsed and sanitized after each use and following any activity when contamination may have occurred.
## 4                                                                                                                                                                                                                       Evidence of mice or live mice present in facility's food and/or non-food areas.
## 5                                                                                                                                                                            Food not protected from potential source of contamination during storage, preparation, transportation, display or service.
## 6                                                                                                                                                      Cold food item held above 41Ã\u0082º F (smoked fish and reduced oxygen packaged foods above 38 Ã\u0082ºF) except during necessary preparation.
## 7                                                                    Plumbing not properly installed or maintained; anti-siphonage or backflow prevention device not provided where required; equipment or floor not properly drained; sewage disposal system in disrepair or not functioning properly.
## 8                                                                                                                                                                                                                                                    Hot food item not held at or above 140Ã\u0082º F.
## 9  Filth flies or food/refuse/sewage-associated (FRSA) flies present in facility\032s food and/or non-food areas. Filth flies include house flies, little house flies, blow flies, bottle flies and flesh flies. Food/refuse/sewage-associated flies include fruit flies, drain flies and Phorid flies.
## 10                                                                                                                                                                       Raw, cooked or prepared food is adulterated, contaminated, cross-contaminated, or not discarded in accordance with HACCP plan.
ggplot(violations_10mostcommon, aes(reorder(violation_code, count), count)) + geom_col() + coord_flip() +
  labs(title = "Distribution of 10 most common violation code for 5 most common restaurant types", x = "violation code")

We only focus on the restaurants from 5 most common restaurant types to explore what type of violations the restaurants made the most. The above plot shows 10 most common violations made by these restaurants. The violation code that was most frequently appeared is 10F, which is ‘Non-food contact surface improperly constructed. Unacceptable material used’ While this violation does not sound very critical, 04L is associated with mice, which sounds more serious and critical. Out of 93 violations, this one has ranked at top 7 and is violated 4473 times. Because the above plot indicates the overall frequency of violations, we decide to explore the frequency of the violation codes for specific restaurant type. Again, as discussed above, we focus only on the violations made by restaurants from 5 most common restaurant types.

# this funciton takes the string (restaurant type) as an input and returns a bar chart and a dataframe that contains the information about the violation codes and the corresponding violation descriptions

violation_code_with_restype <- function(x){
  new_df <- df %>% filter(cuisine == x) %>% select(violation_code) %>% group_by(violation_code) %>%
    summarize(count = n()) %>% arrange(-count) %>% top_n(10)
  
  new_violation <- violation[violation$VIOLATION.CODE %in% new_df$violation_code, ]
  new_violation$VIOLATION.CODE <- factor(new_violation$VIOLATION.CODE, levels = new_df$violation_code)
  new_violation <- new_violation %>% arrange(VIOLATION.CODE)
  print(new_violation)
  
  ggplot(new_df, aes(reorder(violation_code, count), count)) + geom_col() + coord_flip() +
    labs(title = paste("Distribution of violation code of", x, "restaurants", sep = " "), x = "violation code")
}

We first focus on the violation types that American restaurants made. In order to produce a bar chart of 10 most common violation made by these restaurants, the steps we take are following: 1. we first filter out the rows with restaurant type = American 2. Then, we group the dataframe by the violation code and then count the number of each violation has made 3. we arrange the dataframe by the number of the occurrence and then extract top 10 violation codes to produce a bar chart The following steps are used to draw a plot for any restaurant type described below.

While the frequency of the violation code changes (because we are only looking at one specific type of the restaurants), the most commonly violated violation types stay almost the same. The 4 most common violation types are same as the plot shown before this one. Though, these restaurants seem to violate 02G, which associates with the storing cold food item above 41 Farenheit, more often than 10B, which associates with not properly installing or maintaining plumbing. Also, they tend to violate the rule associated with storing and maintaining foods as their 10 most violated code is ‘Raw, cooked or prepared food is adulterated, contaminated, cross-contaminated, or not discarded in accordance with HACCP plan’

violation_code_with_restype("Italian")
## Selecting by count
##    VIOLATION.CODE
## 1             10F
## 2             08A
## 3             04L
## 4             06D
## 5             02G
## 6             02B
## 7             06C
## 8             10B
## 9             04N
## 10            04A
##                                                                                                                                                                                                                                                                                   VIOLATION.DESCRIPTION
## 1                       Non-food contact surface improperly constructed. Unacceptable material used. Non-food contact surface or equipment improperly maintained and/or not properly sealed, raised, spaced or movable to allow accessibility for cleaning on all sides, above and underneath the unit.
## 2                                                                                                                                                                    Facility not vermin proof. Harborage or conditions conducive to attracting vermin to the premises and/or allowing vermin to exist.
## 3                                                                                                                                                                                                                       Evidence of mice or live mice present in facility's food and/or non-food areas.
## 4                                                                                                                                                        Food contact surface not properly washed, rinsed and sanitized after each use and following any activity when contamination may have occurred.
## 5                                                                                                                                                      Cold food item held above 41Ã\u0082º F (smoked fish and reduced oxygen packaged foods above 38 Ã\u0082ºF) except during necessary preparation.
## 6                                                                                                                                                                                                                                                    Hot food item not held at or above 140Ã\u0082º F.
## 7                                                                                                                                                                            Food not protected from potential source of contamination during storage, preparation, transportation, display or service.
## 8                                                                    Plumbing not properly installed or maintained; anti-siphonage or backflow prevention device not provided where required; equipment or floor not properly drained; sewage disposal system in disrepair or not functioning properly.
## 9  Filth flies or food/refuse/sewage-associated (FRSA) flies present in facility\032s food and/or non-food areas. Filth flies include house flies, little house flies, blow flies, bottle flies and flesh flies. Food/refuse/sewage-associated flies include fruit flies, drain flies and Phorid flies.
## 10                                                                                                                                                                                                                               Food Protection Certificate not held by supervisor of food operations.

Next, we decide to explore the violation types the Italian restaurants have made. Comparing all three plots, it seems that the restaurants of different types tend to make similar violations. The 10 most common violation types for Italian restaurants are exactly same as the American restaurants except that the orders of violation codes (in terms of frequency) changes. The Italian restaurants in New York area seem to violate the rule associated with installing and maintaining plumbing properly more than the American restaurants in New York. Moreover, compared to the overall frequency of violation codes, these two types of restaurants seem to have more issue with mice.

violation_code_with_restype("Chinese")
## Selecting by count
##    VIOLATION.CODE
## 1             10F
## 2             08A
## 3             06C
## 4             04L
## 5             02B
## 6             02G
## 7             06D
## 8             10B
## 9             04M
## 10            04N
##                                                                                                                                                                                                                                                                                   VIOLATION.DESCRIPTION
## 1                       Non-food contact surface improperly constructed. Unacceptable material used. Non-food contact surface or equipment improperly maintained and/or not properly sealed, raised, spaced or movable to allow accessibility for cleaning on all sides, above and underneath the unit.
## 2                                                                                                                                                                    Facility not vermin proof. Harborage or conditions conducive to attracting vermin to the premises and/or allowing vermin to exist.
## 3                                                                                                                                                                            Food not protected from potential source of contamination during storage, preparation, transportation, display or service.
## 4                                                                                                                                                                                                                       Evidence of mice or live mice present in facility's food and/or non-food areas.
## 5                                                                                                                                                                                                                                                    Hot food item not held at or above 140Ã\u0082º F.
## 6                                                                                                                                                      Cold food item held above 41Ã\u0082º F (smoked fish and reduced oxygen packaged foods above 38 Ã\u0082ºF) except during necessary preparation.
## 7                                                                                                                                                        Food contact surface not properly washed, rinsed and sanitized after each use and following any activity when contamination may have occurred.
## 8                                                                    Plumbing not properly installed or maintained; anti-siphonage or backflow prevention device not provided where required; equipment or floor not properly drained; sewage disposal system in disrepair or not functioning properly.
## 9                                                                                                                                                                                                                                        Live roaches present in facility's food and/or non-food areas.
## 10 Filth flies or food/refuse/sewage-associated (FRSA) flies present in facility\032s food and/or non-food areas. Filth flies include house flies, little house flies, blow flies, bottle flies and flesh flies. Food/refuse/sewage-associated flies include fruit flies, drain flies and Phorid flies.

The 10 most common violation types for Chinese restaurants seem to be slightly different from those of the previous two restaurant types. The violation of 04L (Evidence of mice or live mice present in facility’s food and/or non-food areas) is more frequent (in terms of relative frequency). Moreover, the 04M code is violated more than 1000 times and is ranked at top 9. This one is associated with live roaches present in facility’s food and/or non-food areas. So far, our results on the different types of restaurants show us that it would probably be a great idea to look at this report first to avoid the restaurants with roaches and mice.

violation_code_with_restype("Dessert")
## Selecting by count
##    VIOLATION.CODE
## 1             10F
## 2             08A
## 3             04L
## 4             06D
## 5             06C
## 6             10B
## 7             04A
## 8             06E
## 9             04N
## 10            04H
##                                                                                                                                                                                                                                                                                   VIOLATION.DESCRIPTION
## 1                       Non-food contact surface improperly constructed. Unacceptable material used. Non-food contact surface or equipment improperly maintained and/or not properly sealed, raised, spaced or movable to allow accessibility for cleaning on all sides, above and underneath the unit.
## 2                                                                                                                                                                    Facility not vermin proof. Harborage or conditions conducive to attracting vermin to the premises and/or allowing vermin to exist.
## 3                                                                                                                                                                                                                       Evidence of mice or live mice present in facility's food and/or non-food areas.
## 4                                                                                                                                                        Food contact surface not properly washed, rinsed and sanitized after each use and following any activity when contamination may have occurred.
## 5                                                                                                                                                                            Food not protected from potential source of contamination during storage, preparation, transportation, display or service.
## 6                                                                    Plumbing not properly installed or maintained; anti-siphonage or backflow prevention device not provided where required; equipment or floor not properly drained; sewage disposal system in disrepair or not functioning properly.
## 7                                                                                                                                                                                                                                Food Protection Certificate not held by supervisor of food operations.
## 8                                                                                                                                                                                                  Sanitized equipment or utensil, including in-use food dispensing utensil, improperly used or stored.
## 9  Filth flies or food/refuse/sewage-associated (FRSA) flies present in facility\032s food and/or non-food areas. Filth flies include house flies, little house flies, blow flies, bottle flies and flesh flies. Food/refuse/sewage-associated flies include fruit flies, drain flies and Phorid flies.
## 10                                                                                                                                                                       Raw, cooked or prepared food is adulterated, contaminated, cross-contaminated, or not discarded in accordance with HACCP plan.

violation_code_with_restype("Latin (Cuban, Dominican, Puerto Rican, South & Central American)")
## Selecting by count
##    VIOLATION.CODE
## 1             10F
## 2             08A
## 3             02B
## 4             04L
## 5             06C
## 6             04N
## 7             02G
## 8             10B
## 9             06D
## 10            04M
##                                                                                                                                                                                                                                                                                   VIOLATION.DESCRIPTION
## 1                       Non-food contact surface improperly constructed. Unacceptable material used. Non-food contact surface or equipment improperly maintained and/or not properly sealed, raised, spaced or movable to allow accessibility for cleaning on all sides, above and underneath the unit.
## 2                                                                                                                                                                    Facility not vermin proof. Harborage or conditions conducive to attracting vermin to the premises and/or allowing vermin to exist.
## 3                                                                                                                                                                                                                                                    Hot food item not held at or above 140Ã\u0082º F.
## 4                                                                                                                                                                                                                       Evidence of mice or live mice present in facility's food and/or non-food areas.
## 5                                                                                                                                                                            Food not protected from potential source of contamination during storage, preparation, transportation, display or service.
## 6  Filth flies or food/refuse/sewage-associated (FRSA) flies present in facility\032s food and/or non-food areas. Filth flies include house flies, little house flies, blow flies, bottle flies and flesh flies. Food/refuse/sewage-associated flies include fruit flies, drain flies and Phorid flies.
## 7                                                                                                                                                      Cold food item held above 41Ã\u0082º F (smoked fish and reduced oxygen packaged foods above 38 Ã\u0082ºF) except during necessary preparation.
## 8                                                                    Plumbing not properly installed or maintained; anti-siphonage or backflow prevention device not provided where required; equipment or floor not properly drained; sewage disposal system in disrepair or not functioning properly.
## 9                                                                                                                                                        Food contact surface not properly washed, rinsed and sanitized after each use and following any activity when contamination may have occurred.
## 10                                                                                                                                                                                                                                       Live roaches present in facility's food and/or non-food areas.

The bar chart of the violation codes for Latin and Dessert restaurants look almost the same as other three types of restaurants. The violation associated with the evidence of mice or live mice present in facility’s food and/or non-food areas have frequently made in both types of restaurants compared to other violations. Moreover, Latin restaurants have the issue of roaches as well. The Dessert type restaurants tend to violate the code associated with sanitizing utensils and washing food contact surface more often. Though, compared to the overall (containing all the restaurants) frequency of the violation codes, the code associated with the mice one is ranked lower for these types of restaurants.

Also, we decide to study the association between the violation codes and the years. In other words, we decide to explore more to see if there are any trends in the types of violations across the years. We still focus on the 5 most common restaurants to keep it simple and more interpretable. In order to produce a plot for each year, we take the same steps as described above except that we create new columns called year and month which extract year and month information from inspection date using mutate function.

Additional note: we decide to focus on 08A, 06D, 04L, and 06C only. While including all the violation codes provide more information about what overall violation trends across the year, this can make the graph more complicated and hard to look at. So, to make it simple and more readable, we choose 08A, 06D, 04L, and 06C as our x variable to see if there are any trends in these violation types. Again, because we are most interested in the shape of the distribution for each type of restaurant and each year, we set the scale to free to prevent some of the graphs from shrinking towards 0.

violation_type <- c("08A", "06D", "04L", "06C")
top5_year <- top5 %>% filter(violation_code %in% violation_type) %>% mutate(year = factor(format(inspection_date, "%Y")), month = factor(format(inspection_date, "%m"))) %>% select(cuisine, year, violation_code) %>% group_by(cuisine, year, violation_code) %>% summarise(count = n())


ggplot(top5_year, aes(x = violation_code, y = count, fill = year)) +
  geom_col(position = "dodge") +
  facet_grid(cuisine~., scale = "free") +
  ggtitle("Trends in frequency of violation types") +
  theme_grey(16)

One noticeable feature from this graph is that there are not much data in 2013. In fact, it is also shown in the dataframe that the frequency of any violation types for 2013 is less than 10. The frequency of 08A is highest in the year of 2017 for American, Dessert, Chinese, and Italian. Only the plot of Latin restaurants has the highest frequency of 08A in the year of 2015. There is increasing pattern in any violation code for Chinese and Dessert restaurants. All of the violation code reached its peak in the year of 2017. It then decreases significantly at 2018, but it would probably be because the inspections for 2018 has not finished yet (since it is just April). While there is an increasing pattern in 04L, 06D, and 08A for Italian and American restaurants, it seems like the frequency of the violations of 06C (Food not protected from potential source of contamination during storage, preparation, transportation, display or service) has decreased after 2016. For Latin restaurants, the frequency of all four type of violations increase at 2015 and then decrease after 2016. Also, the Latin restaurants are the only one which has a decreasing trend in 04L that is associated with mice.

Restaurants by case

# Top 10 Resteraurants Ranked by Number of Critical Violation
# Regardless of type of violation

# group data
restaurant <- inspecttbl%>%group_by(INSPECTION.DATE, DBA, SCORE, GRADE, CRITICAL.FLAG,BORO, BUILDING, STREET, ZIPCODE)
# renaming DBA(doing business as usual) to restaurant
restaurant<- restaurant%>%rename( RESTAURANT = DBA)
# Count voilations according to restaurant and type of violation [~distinct] count the instances regardless of violation type per score, per restaurant, per date
count<-restaurant%>%summarise(count=n())

# select only critical violation
citical_count<-count%>%filter(CRITICAL.FLAG=='Critical')

# Find the worest restaurants by calculating the number of that violate with Grade C under critical situation
citical_count<-citical_count%>%filter(SCORE>27)%>%group_by(RESTAURANT,STREET)%>%mutate(violation_count=sum(count))


# order the data # restaurant the receives violates grade C most
res_critical_vio_count<-distinct(citical_count%>%select(STREET,RESTAURANT,violation_count)%>%arrange(desc(violation_count)))
result<-res_critical_vio_count[1:10,c(2,3)]
result$RESTAURANT<-reorder(result$RESTAURANT,result$violation_count)
g<-ggplot(result, aes(RESTAURANT, violation_count)) +
  geom_bar(stat = 'identity',fill="tan2")  +
  coord_flip()+
  #xlabel,ylabel 
  labs(title="10 Worst Resteraurants Ranked by Number of \n Critical Violation ",x = "", y = "Critical Violation Count")
g

#ggplotly(g,tooltip=c("x", "y"))

We want to evaluate the restaurants in NYC inspection and find the worst restaurants in NYC. We define two ways to evaluate how bad the restaurant is, by violation incidents and by score of those violation. For the first method, we count the number of violation incident with grade C under critical condition [CRITICAL.FLAG==‘Critical’] for each restaurant. Across all records, we plot the 10 restaurants with highest violation count. Overall, there are 5 restaurants have critical violation counts more than 40. Note that the first one ‘Gang San Deul’ receives over 45 violation that is marked critical. We can see that some of them are Asian Cuisine from the restaurant names.

# restaurant have the highest violation score
# Define bad restaurants as violation  Score > 27 [Grade C] ~ 5 occurance of Grade C
citical_count_score<-citical_count%>%
  filter(SCORE>27,count>5)%>%arrange(desc(SCORE))

result<-citical_count_score[1:10,c(2,3,10)]
result$RESTAURANT<-reorder(result$RESTAURANT,result$SCORE)

g<-ggplot(result, aes(RESTAURANT, SCORE)) +
  geom_bar(stat = 'identity',fill="tan2")  +
  coord_flip()+
  #theme(text = element_text(size=10), axis.text.x = element_text(angle=45, vjust=1, color = "grey")) +
  labs(title="Worst 10 Highest Violation Score Restaurant ",x = "", y = "Score")

#ggplotly(g)
g

# save the csv for google api
# csv<-citical_count_score
# write.csv(csv[1:15,c(1,4,2,3,10)], file = '/Users/Selina/Desktop/Visualization/EDAV-master/chart.csv',row.names = FALSE)

For the second method, we want to give particular emphasis on the times of violation occurs for a restaurant during one inspection. We take into account of violations that have grade C [with score > 27] and have occurred more than 5 times for a restaurant. In the plot, we find that all of the restaurants have score above 100. In other words, these restaurants have at least 5 violation scored larger than 100 at the same date of inspection! Note that ’Margarita Island’ have the highest violation score [151] among all the restaurants.

Google chart is a nice API for visualization in javascript, base on the what we learned this semester on d3, we decide to explore this API a bit. Click here to see the interactive google chart version of the plot.

#### INSPECTION.TYPE Violation related####
# we focus on instore sanity check of the inspection#
# dis-regarding rows from initial inspections as grades and scores are not counted
gradables <- c("Cycle Inspection / Initial Inspection" , "Cycle Inspection / Re-inspection", 
               "Pre-permit (Operational) / Initial Inspection", "Pre-permit (Operational) / Re-inspection")

inspecttbl1 <- (filter(inspecttbl, `INSPECTION.TYPE` %in% gradables))


# critical score and inspection type of chain restaurant
# store per year average critical score variation
cafe<-inspecttbl1%>%filter(CUISINE.DESCRIPTION=="Café/Coffee/Tea")%>%mutate(year = format(INSPECTION.DATE, "%Y"))

STARBUCKS<-cafe[grep("STARBUCKS",cafe$DBA),]%>%group_by(year,BORO)%>%summarise(MEAN = mean(SCORE))
STARBUCKS$DBA<-"STARBUCKS"

DUNKIN<-cafe[grep("DUNKIN*",cafe$DBA),]%>%group_by(year,BORO)%>%summarise(MEAN = mean(SCORE))
DUNKIN$DBA<-"DUNKIN"

PICCOLO<-cafe[grep("PICCOLO*",cafe$DBA),]%>%group_by(year,BORO)%>%summarise(MEAN = mean(SCORE))
PICCOLO$DBA<-"PICCOLO"

GREGORY<-cafe[grep("GREGORY'S COFFEE*",cafe$DBA),]%>%group_by(year,BORO)%>%summarise(MEAN = mean(SCORE))
GREGORY$DBA<-"GREGORY"

score<-rbind(STARBUCKS,DUNKIN,GREGORY,PICCOLO)
score_boro<-score%>%filter(BORO=="MANHATTAN"|BORO=="QUEENS"|BORO=="BROOKLYN")
score_boro$DBA<-as.factor(score_boro$DBA)

csv<-score_boro
# change to local path
# write.csv(csv, file = 'YOURPATH',row.names = FALSE)
#write.csv(csv, file = '/Users/Selina/Desktop/Visualization/EDAV-master/boro_mean.csv',row.names = FALSE)
# csv <- read_csv("YOURPATH")
#csv <- read_csv("/Users/Selina/Desktop/Visualization/EDAV-master/boro_mean.csv")

g<-ggplot(csv, aes(x = year, y = MEAN, fill = DBA)) + 
  geom_col(position = "dodge") +
  facet_grid(fct_relevel(score_boro$BORO,"MANHATTAN")~.) +
  ggtitle("Mean Violation Score") +
  theme_grey(16)

g+ scale_colour_colorblind()

ggplotly(g,tooltip=c("x", "y"))

We gather ‘Cafe’ category from all the restaurant types, and want to see the violation score difference among the cafes.

In this section, we want to examine how well those chain restaurants did in food safety. As mentioned by Julie, inspection types varies across the record. Again, we eliminated some of the inspections types as they are irrelevant of our goal, such as Trans Fat, Calorie type, Compliance, Special Program, and Administrative violations inspections.

Among all the Cafe, we pick out the 4 chain stores across U.S., including STARBUCKS, DUNKIN, GREGORY, and PICCOLO and evaluate the quality of those chain cafes store in NY. Sine all 4 of them have many local stores in NYC. It is easy to visualize according to the mean violation score of each base on Borough.

Among these 4 Cafes, GREGORY and PICCOLO stores are only located in Manhattan. And PICCOLO started to be recorded after 2015. Looking at the plot of Manhattan,

  1. we see that from 2015 to 2018, PICCOLO perform the worst and has the highest mean violation score.
  2. Score of DUNKIN fluctuate from year to year. It has the lowest score 4 at 2014 (less than one half of GREGORY — who had the second lowest score); the mean score jumped to almost 20 at 2017.
  3. GREGORY have the most consistent scores across year.

For Brooklyn and Queens,

  1. DUNKIN in general has higher violation score in Brooklyn, comparing to STARBUCKS. However, it operated more according to the safety rules and had a lower mean score in QUEENS.
  2. STARBUCKS’s mean score ranges from 10 to 15 in all three boroughs.
# CAFE common violation type
# edit
STARBUCKS_type<-inspecttbl1[ grep("STARBUCKS",inspecttbl1$DBA),]%>%group_by(DBA,VIOLATION.CODE)%>%summarise(vio_freq=n()/2431)%>%arrange(desc(vio_freq))
STARBUCKS_type$DBA<-"STARBUCKS"

interm<-inspecttbl1[ grep("DUNKIN*",inspecttbl1$DBA),]
interm$DBA<-"DUNKIN"
DUNKIN_type<-interm%>%group_by(DBA,VIOLATION.CODE)%>%summarise(vio_freq=n()/6004)%>%arrange(desc(vio_freq))
DUNKIN_type$DBA<-"DUNKIN"

PICCOLO_type<-inspecttbl1[ grep("PICCOLO*",inspecttbl1$DBA),]%>%group_by(DBA,VIOLATION.CODE)%>%summarise(vio_freq=n()/339)%>%arrange(desc(vio_freq))
PICCOLO_type$DBA<-"PICCOLO"

GREGORY_type<-inspecttbl1[ grep("GREGORY'S COFFEE*",inspecttbl1$DBA),]%>%group_by(DBA,VIOLATION.CODE)%>%summarise(vio_freq=n()/98)%>%arrange(desc(vio_freq))
GREGORY_type$DBA<-"GREGORY"

type<-rbind(STARBUCKS_type[1:5,], DUNKIN_type[1:5,],PICCOLO_type[1:5,],GREGORY_type[1:5,])
type$DBA<-factor(type$DBA)
type<- type%>%rename( ViolationCode = VIOLATION.CODE,frequency=vio_freq)
g<-ggplot(type, aes(x = ViolationCode, y = frequency)) +
  geom_col(position = "dodge",fill="tan2") +
  facet_grid(~ fct_relevel(type$DBA,"STARBUCKS"),scales = 'free')+
  ggtitle("Most Frequent Violation Code for Cafes") +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))+
  theme_grey(16)

g<- ggplotly(g,tooltip=c("x", "y"))
g

After seeing how mean violation score of those chain cafe varying according to year and borough, we look at the most frequent violation code for each of them.

10F ‘Non-food contact surface improperly constructed’ appears in Top 5 most frequent violation for every Cafe. It makes sense since it is the most common violation in NYC. Notice that in STARBUCKS, it really stands out from all the violation and count for about 30% of all of STARBUCKS stores records. The result is good, since the violation is relatively trivial comparing to 04 series categories (eg. 04N the evidence of flies etc.) Whereas in DUNKIN and GREGORY, 10F counts for over 15% of the violation records.

In the case of PICCOLO [which have the highest violation score], the top 5 violation only counts very small portions of overall violation.

# select code c('10F','10B','08A') 
# 08A vermin
# 04N flies
code<-inspecttbl1%>%filter(VIOLATION.CODE=='10F'|VIOLATION.CODE=='10B'|VIOLATION.CODE=='08A')%>%mutate(year = format(INSPECTION.DATE, "%Y"),month = format(INSPECTION.DATE, "%Y"))
STARBUCKS_type<-code[ grep("STARBUCKS",code$DBA),]%>%group_by(DBA,VIOLATION.CODE)%>%summarise(vio_freq=n())%>%arrange(desc(vio_freq))
STARBUCKS_type$DBA<-"STARBUCKS"

DUNKIN_type<-code[ grep("DUNKIN*",code$DBA),]%>%group_by(DBA,VIOLATION.CODE)%>%summarise(vio_freq=n())%>%arrange(desc(vio_freq))
DUNKIN_type$DBA<-"DUNKIN"

PICCOLO_type<-code[ grep("PICCOLO*",code$DBA),]%>%group_by(DBA,VIOLATION.CODE)%>%summarise(vio_freq=n())%>%arrange(desc(vio_freq))
PICCOLO_type$DBA<-"PICCOLO"

GREGORY_type<-code[ grep("GREGORY'S COFFEE*",code$DBA),]%>%group_by(DBA,VIOLATION.CODE)%>%summarise(vio_freq=n())%>%arrange(desc(vio_freq))
GREGORY_type$DBA<-"GREGORY"

code<-rbind(STARBUCKS_type[1:5,], DUNKIN_type[1:5,],PICCOLO_type[1:5,],GREGORY_type[1:5,])
code$DBA<-factor(code$DBA)
code$DBA<-fct_relevel(code$DBA,"STARBUCKS")
# code$DBA<-fct_relevel(code$DBA,"PICCOLO", )
mosaic(VIOLATION.CODE ~ DBA, code, direction = c("v", "h"),labeling= labeling_border(rot_labels = c(15,0,0,0)),main="Violation Code Ratio of Cafes") #gp = gpar(fill = c("blue", "lightblue"),rot_labels=c(0,90,0,0),just_labels="right" )

#, labeling= labeling_border(rot_labels = c(0,90,0,0), 
                                  # just_labels = c("center", 
                                      #           "center", 
                                       #          "center", 
                                       #           "right"))

We find the 10F[non-food contact surface improperly constructed], 10B[improper sewage disposal system], 08A[Facility not vermin proof and have potential to attract vermin] are the most frequent violation code combination across these 4 stores.

A close look at their distribution by store would be interesting. From the plot, we can see that STARBUCKS and DUNKIN has very similar violation ratio distribution, whereas GREGORY have more violation in 10B , and less in 10F and 08A. PICCOLO has a significant increase in 08A.

# per year/date
yearmonth<-inspecttbl1%>%filter(CUISINE.DESCRIPTION=="Café/Coffee/Tea")%>%mutate(yearmonth = format(INSPECTION.DATE, "%Y-%m"))
STARBUCKS<-yearmonth[grep("STARBUCKS",yearmonth$DBA),]%>%group_by(yearmonth)%>%filter(VIOLATION.CODE=="08A")%>%summarise(Score = mean(SCORE))
STARBUCKS$Date<-as.Date(as.yearmon(STARBUCKS$yearmonth, "%Y-%m"))
g<-ggplot(STARBUCKS) + geom_line(aes(Date, Score)) +
  geom_point(aes(Date, Score))+
    ggtitle("STARBUCKS Violation Score 08A Monthly Mean Score Variation") +
    labs (x = "Month", y = "Score") +
    theme_grey(16) +
    theme(legend.title = element_blank())
p <- ggplotly(g, tooltip=c("x", "y"))
p

As Lisa has shown in the previous section, 10F and 08A are two most frequent violation in NYC. We decided to look at 08A violation particular in STARBUCKS since it represents the dimension we cared about more or more serious in food safety issue of the city - ‘potential presence of vermin (eg. mice).’

The time series plot according to ‘potential vermin presents [08A]’ violation shows the monthly change of score for STARBUCKS the past 5 years. There is a general pattern of decreasing in score across years. And the variation of score also decreases, meaning that all STARBUCKS in NYC are reaching more and more consistent food safety quality maybe through stricter regularization.

# analysis of area
CAFE<-inspecttbl1%>%filter(CUISINE.DESCRIPTION=="Café/Coffee/Tea")%>%mutate(year = format(INSPECTION.DATE, "%Y"))

STARBUCKS<-CAFE[grep("STARBUCKS",CAFE$DBA),]%>%group_by(BORO,year)%>%filter(VIOLATION.CODE=="08A")%>%summarise(Score = mean(SCORE))
STARBUCKS<-STARBUCKS%>%filter(year!="2018",year!="2014")
STARBUCKS$Boro<-reorder(STARBUCKS$BORO,STARBUCKS$Score)
g<-ggplot(STARBUCKS) +
  geom_bar(aes(Boro, Score),stat = 'identity',fill="tan2")  +
  facet_grid(~year)+
  labs(title="Mean Violation Code 08A By BORO ",x = "NYC Borough", y = "08A Violation Score")+
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))
g <- ggplotly(g,tooltip=c("x", "y"))
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
g

We are wondering if this violation score is different across different area of NYC as well. The Mean Violation Code By Boro shows the following trend. Since the data in 2014 and 2018 are not recorded full-year around. We take 2015, 2016, 2017 to analysis the overall trend.

  1. Potential presence of vermin were consistent across boroughs in 2015. Score is about 15.
  2. STARBUCKS in Bronx improved the quality in this area and become the best across NYC for 2016 and 2017.
  3. A large portion of the STARBUCKS are located in Manhattan, comparing to anywhere else. Those STARBUCKS did okay not the best in keeping restaurant tidy but not the worst either.
  4. BROOKLYN STARBUCKS seemed to be the ones that have the worst condition across all three years.

Executive Summary

(Provide a short nontechnical summary of the most revealing findings of your analysis written for a nontechnical audience. The length should be approximately two pages (if we were using pages…) Take extra care to clean up your graphs, ensuring that best practices for presentation are followed.)

Conclusion